Introduction The data we used are related to health insurance that were simulated based on demographic statistics collected by the United States Census Bureau (USCB), available at “https://www.kaggle.com/mirichoi0218/insurance”.

The variables in this dataset correspond to parameters that influence the value of medical expenses charged by health insurance. For the implementation of linear regression, we selected the charges as the dependent variable since it depends on all the others. Therefore, with this work, we aim to assess the influence of age, gender, BMI, number of children, smoking status, and region of residence of each insured individual on the individual medical expenses charged by health insurance.

Variables Age: age of the beneficiary/insured;

Gender: gender of the insured (male/female);

BMI: body mass index that provides insight into the body, indicating weights that are relatively high or low compared to height. Ideally, the objective body weight index will be from 18.5 to 24.9 kg/m2, for which the ratio of height to weight is used;

Children: number of children covered by health insurance (number of dependents);

Smoker: indicator of whether they smoke (yes/no);

Region: region of the US where the beneficiary lives (northeast, southeast, southwest, northwest);

Charges: individual medical expenses charged by health insurance.

Data Importation

# Importação do ficheiro csv
insurance = read.csv(file='insurance.csv', header= T, sep=';', dec ='.') 

After importing the selected file from the aforementioned database, we factorized the age and BMI variables.

For the age variable, the following age groups were defined:

  • Young adult: Age between 17 and 30
  • Middle age: Age between 31 and 45
  • Elderly: Age 45 or older

For the BMI variable, the following divisions were made:

  • Underweight: BMI is less than 18.5
  • Normal weight: BMI is from 18.5 to 24.9
  • Overweight: BMI is from 25 to 29.9
  • Obese: BMI is 30 or more
# Fatorização da variável idade em relação às taxas
jovem_adulto = (insurance$charges[0:444])
idade_media = (insurance$charges[445:838])
idade_avancada = (insurance$charges[839:1338])

values = c(jovem_adulto,idade_media,idade_avancada)
ind = c(rep('1',length(jovem_adulto)),
      rep('2',length(idade_media)),
      rep('3',length(idade_avancada)))

idade = factor(ind)

# Fatorização da variável bmi em relação às taxas

indice_massa = insurance[with(insurance, order(insurance$bmi, insurance$charges)), ]
baixo_peso = (indice_massa$charges[0:21])
peso_normal = (indice_massa$charges[22:245])
sobrepeso = (indice_massa$charges[246:631])
obeso = (indice_massa$charges[632:1338])

values2 = c(baixo_peso,peso_normal,sobrepeso,obeso)
ind2 = c(rep('1',length(baixo_peso)),
      rep('2',length(peso_normal)),
      rep('3',length(sobrepeso)),
      rep('4',length(obeso)))

bmi = factor(ind2)

Informations

# Informação preliminar do dataset
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ sex     : chr  "male" "male" "female" "female" ...
##  $ bmi     : num  33.8 34.1 26.3 38.7 35.6 ...
##  $ children: int  1 0 0 2 0 2 0 0 0 0 ...
##  $ smoker  : chr  "no" "no" "no" "no" ...
##  $ region  : chr  "southeast" "southeast" "northeast" "northeast" ...
##  $ charges : num  1726 1137 2198 3393 2211 ...
#Verificação da hipótese de existir dados em falta
any(is.na(insurance))
## [1] FALSE

The dataset does not contain missing data (NA).

This dataset contains data from 1338 individuals, distributed across 7 variables as explained earlier, with 3 of them being continuous, one discrete (number of children), and 3 categorical. Among the categorical variables, two are binary (gender and smoker), and the other has 4 levels (region). The predictive variables include age, gender, BMI, number of children, smoker, and region.

Exploratory Analysis

#install.packages("summarytools")
library(summarytools)

dfSummary(insurance,na.col=F,valid.col=F)
## Data Frame Summary  
## insurance  
## Dimensions: 1338 x 7  
## Duplicates: 1  
## 
## ---------------------------------------------------------------------------------------------
## No   Variable      Stats / Values                Freqs (% of Valid)     Graph                
## ---- ------------- ----------------------------- ---------------------- ---------------------
## 1    age           Mean (sd) : 39.2 (14)         47 distinct values     :                    
##      [integer]     min < med < max:                                     :                    
##                    18 < 39 < 64                                         : : . : : . : . : .  
##                    IQR (CV) : 24 (0.4)                                  : : : : : : : : : :  
##                                                                         : : : : : : : : : :  
## 
## 2    sex           1. female                     662 (49.5%)            IIIIIIIII            
##      [character]   2. male                       676 (50.5%)            IIIIIIIIII           
## 
## 3    bmi           Mean (sd) : 30.7 (6.1)        548 distinct values        : :              
##      [numeric]     min < med < max:                                         : :              
##                    16 < 30.4 < 53.1                                       . : : :            
##                    IQR (CV) : 8.4 (0.2)                                   : : : :            
##                                                                         . : : : : :          
## 
## 4    children      Mean (sd) : 1.1 (1.2)         0 : 574 (42.9%)        IIIIIIII             
##      [integer]     min < med < max:              1 : 324 (24.2%)        IIII                 
##                    0 < 1 < 5                     2 : 240 (17.9%)        III                  
##                    IQR (CV) : 2 (1.1)            3 : 157 (11.7%)        II                   
##                                                  4 :  25 ( 1.9%)                             
##                                                  5 :  18 ( 1.3%)                             
## 
## 5    smoker        1. no                         1064 (79.5%)           IIIIIIIIIIIIIII      
##      [character]   2. yes                         274 (20.5%)           IIII                 
## 
## 6    region        1. northeast                  324 (24.2%)            IIII                 
##      [character]   2. northwest                  325 (24.3%)            IIII                 
##                    3. southeast                  364 (27.2%)            IIIII                
##                    4. southwest                  325 (24.3%)            IIII                 
## 
## 7    charges       Mean (sd) : 13270.4 (12110)   1337 distinct values   :                    
##      [numeric]     min < med < max:                                     : .                  
##                    1121.9 < 9382 < 63770.4                              : :                  
##                    IQR (CV) : 11899.6 (0.9)                             : :                  
##                                                                         : : : : . . . .      
## ---------------------------------------------------------------------------------------------

The descriptive statistics are presented above.

Age Variable

var(insurance$age)
## [1] 197.4014
sd(insurance$age)
## [1] 14.04996
boxplot (values3~idade2, names = c('jovem_adulto','idade_media','idade_avançada'), xlab = 'Faixas etárias', ylab = 'Idades', col = c('lightpink','lightgreen','lightblue'),main = 'Distribuição das idades por faixas etárias')

Gender variable

genero = table(insurance$sex)
barplot (genero, ylim = c (0,700), names.arg = c('Feminino','Masculino'), col = c('lightpink','lightblue'), ylab = 'Número de pessoas', main = 'Número de pessoas por género' )

BMI variable

var(insurance$bmi)
## [1] 37.18788
sd(insurance$bmi)
## [1] 6.098187
boxplot (values4~bmi2, names = c('baixo_peso','peso_normal','sobrepeso','obeso'), xlab = 'Faixas de BMI', ylab = 'Valores de BMI', col = c('lightpink','lightgreen','lightblue', 'lightyellow'),main = 'Distribuição das faixas de BMI por índice de massa corporal')

Children variable

var(insurance$children)
## [1] 1.453213
sd(insurance$children)
## [1] 1.205493
boxplot(insurance$children, ylab = 'Número de filhos', main = 'Distribuição de filhos na população', col = 'lightblue')

color <- colorRampPalette(c("darkblue","lightblue"))
barplot(table (insurance$children), ylim = c(0,700), xlab = 'Número de filhos', ylab = 'Número de pessoas', main = 'Distribuição de filhos na população', col = color (6))

Smokers variable

barplot (table(insurance$smoker), ylim = c(0,1200), names = c('Não fumador','Fumador'), main = 'Distribuição de fumadores na população',col =c('lightgreen','lightblue'), ylab = 'Número de pessoas')

Region variable

regiao = table (insurance$region)
barplot (regiao, ylim = c(0,400), main = 'Distribuição das regiões',col =c('lightgreen','lightblue','lightpink','lightyellow'),ylab = 'Número de pessoas')

Charges variable

var(insurance$charges)
## [1] 146652372
sd(insurance$charges)
## [1] 12110.01
boxplot (insurance$charges, ylim= c(0,65000), main = 'Distribuição dos taxas médicos',ylab = 'Preço ($)',col = 'lightyellow')

ANOVA Analysis

Since the dependent variable we aim to study is “charges”, an analysis of this variable was conducted compared to the independent variables (gender, age, region, and BMI).

Age Variable

# Boxplot da distribuição das taxas por faixas etárias
boxplot(values~idade,names = c('jovem_adulto','idade_media','idade_avançada'), xlab = 'Faixas etárias', ylab = 'Taxas', col = c('lightpink','lightgreen','lightblue'),main = 'Distribuição das taxas por faixas etárias')

#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~idade)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  insurance$charges by idade
## Fligner-Killeen:med chi-squared = 7.3869, df = 2, p-value = 0.02489
oneway.test(insurance$charges~idade,var.equal=F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  insurance$charges and idade
## F = 54.7, num df = 2.00, denom df = 866.79, p-value < 2.2e-16
model= aov(insurance$charges~idade)
summary(model)
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## idade          2 1.453e+10 7.267e+09   53.44 <2e-16 ***
## Residuals   1335 1.815e+11 1.360e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = insurance$charges ~ idade)
## 
## $idade
##         diff      lwr      upr     p adj
## 2-1 3249.904 1356.188 5143.619 0.0001765
## 3-1 7802.877 6018.684 9587.070 0.0000000
## 3-2 4552.973 2709.792 6396.154 0.0000000

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

Outliers were observed for all factors in the boxplot of charges by age groups. The factor with a relatively higher mean concerning charges is elderly age group. Regarding the ANOVA analysis, significant differences were observed between middle age and young adult, elderly and young adult, and between elderly and middle age.

Gender variable

boxplot(charges~sex, data=insurance, col = c('lightpink','lightgreen'), main = 'Distribuição das taxas por géneros', xlab = 'Género', ylab = 'Taxas', names = c('Feminino','Masculino'))

#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~insurance$sex)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  insurance$charges by insurance$sex
## Fligner-Killeen:med chi-squared = 9.4445, df = 1, p-value = 0.002118
oneway.test(insurance$charges~insurance$sex,var.equal=F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  insurance$charges and insurance$sex
## F = 4.4137, num df = 1.0, denom df = 1313.4, p-value = 0.03584
model= aov(insurance$charges~insurance$sex)
summary(model)
##                 Df    Sum Sq   Mean Sq F value Pr(>F)  
## insurance$sex    1 6.436e+08 643590180     4.4 0.0361 *
## Residuals     1336 1.954e+11 146280413                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = insurance$charges ~ insurance$sex)
## 
## $`insurance$sex`
##                 diff      lwr      upr     p adj
## male-female 1387.172 89.81229 2684.532 0.0361327

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

Through the boxplot of charges by gender, it is evident that both factors have outliers. The ANOVA analysis reveals significant differences between the levels (Female-Male) and the dependent variable.

BMI variable

# Boxplot da distribuição das taxas por Índice de massa corporal
boxplot(values2~bmi, names = c('baixo_peso', 'peso_normal', 'sobrepeso', 'obeso'), xlab = 'Índice de massa Corporal', ylab = 'Taxas', col = c('lightpink','lightgreen','lightblue', 'lightyellow'),main = 'Distribuição das taxas por índice de massa corporal')

#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(indice_massa$charges~bmi)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  indice_massa$charges by bmi
## Fligner-Killeen:med chi-squared = 34.434, df = 3, p-value = 1.604e-07
oneway.test(indice_massa$charges~bmi,var.equal=F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  indice_massa$charges and bmi
## F = 20.341, num df = 3.000, denom df = 97.562, p-value = 2.556e-10
model= aov(indice_massa$charges~bmi)
summary(model)
##               Df    Sum Sq   Mean Sq F value  Pr(>F)    
## bmi            3 7.941e+09 2.647e+09   18.77 6.3e-12 ***
## Residuals   1334 1.881e+11 1.410e+08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = indice_massa$charges ~ bmi)
## 
## $bmi
##          diff        lwr       upr     p adj
## 2-1 1776.9104 -5194.5001  8748.321 0.9135832
## 3-1 2329.8892 -4514.9803  9174.759 0.8175344
## 4-1 6894.7148   130.4966 13658.933 0.0437917
## 3-2  552.9788 -2012.7959  3118.754 0.9453699
## 4-2 5117.8044  2775.6668  7459.942 0.0000001
## 4-3 4564.8256  2631.6202  6498.031 0.0000000

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

According to the analysis of the boxplot of charges by BMI, outliers are only present for “obesity”, which also has the highest mean regarding charges. Through the ANOVA results analysis, significant differences are observed only between obesity and the other factors (underweight, normal weight, and overweight). Thus, the other factors do not show differences between them.

Children variable

boxplot(charges~children, data=insurance, col = color (6), main = 'Distribuição das taxas por número de filhos', xlab = 'Número de filhos', ylab = 'Taxas')

#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
children = factor (insurance$children)
fligner.test(insurance$charges~children)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  insurance$charges by children
## Fligner-Killeen:med chi-squared = 22.198, df = 5, p-value = 0.0004801
oneway.test(insurance$charges~children,var.equal=F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  insurance$charges and children
## F = 6.8999, num df = 5.00, denom df = 122.04, p-value = 1.051e-05
model= aov(insurance$charges~children)
summary(model)
##               Df    Sum Sq   Mean Sq F value  Pr(>F)   
## children       5 2.397e+09 479383343   3.297 0.00579 **
## Residuals   1332 1.937e+11 145403382                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = insurance$charges ~ children)
## 
## $children
##           diff          lwr      upr     p adj
## 1-0   365.1962  -2026.08480 2756.477 0.9980122
## 2-0  2707.5881     62.30881 5352.867 0.0412937
## 3-0  2989.3428   -110.03004 6088.716 0.0660149
## 4-0  1484.6807  -5546.17786 8515.539 0.9908612
## 5-0 -3579.9404 -11817.32899 4657.448 0.8169344
## 2-1  2342.3919   -588.38208 5273.166 0.2025346
## 3-1  2624.1465   -722.20151 5970.495 0.2210585
## 4-1  1119.4845  -6023.68749 8262.656 0.9977503
## 5-1 -3945.1366 -12278.59355 4388.320 0.7562079
## 3-2   281.7546  -3250.57080 3814.080 0.9999166
## 4-2 -1222.9074  -8455.07055 6009.256 0.9967694
## 5-2 -6287.5285 -14697.39071 2122.334 0.2705133
## 4-3 -1504.6621  -8914.97868 5905.655 0.9923749
## 5-3 -6569.2831 -15132.83330 1994.267 0.2433597
## 5-4 -5064.6211 -15702.34884 5573.107 0.7517252

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

According to the analysis of the boxplot of charges by the number of children variable, outliers are present. The factor 4 (number of children) has the highest mean regarding charges. Through the ANOVA results analysis, significant differences are observed between 0 and 2 children.

Smoker variable

boxplot(charges~smoker, data=insurance, col = c('lightpink','lightgreen'), main = 'Distribuição das taxas por fumadores', xlab = 'Fumadores', ylab = 'Taxas', names = c('Não','Sim'))

#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~insurance$smoker)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  insurance$charges by insurance$smoker
## Fligner-Killeen:med chi-squared = 238.15, df = 1, p-value < 2.2e-16
oneway.test(insurance$charges~insurance$smoker,var.equal = F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  insurance$charges and insurance$smoker
## F = 1072.7, num df = 1.00, denom df = 311.85, p-value < 2.2e-16
oneway.test(insurance$charges~insurance$smoker,var.equal = T)
## 
##  One-way analysis of means
## 
## data:  insurance$charges and insurance$smoker
## F = 2177.6, num df = 1, denom df = 1336, p-value < 2.2e-16
model= aov(insurance$charges~insurance$smoker)
summary(model)
##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## insurance$smoker    1 1.215e+11 1.215e+11    2178 <2e-16 ***
## Residuals        1336 7.455e+10 5.580e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = insurance$charges ~ insurance$smoker)
## 
## $`insurance$smoker`
##            diff      lwr      upr p adj
## yes-no 23615.96 22623.17 24608.75     0

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

According to the analysis of the boxplot of charges by the smoker variable, outliers are only present for “no”, which also has the lowest mean regarding charges. Through the ANOVA results analysis, significant differences are observed between the levels (No/Yes).

Region variable

boxplot(charges~region, data=insurance, col = c('lightpink','lightgreen', 'lightblue', 'lightyellow'), main = 'Distribuição das taxas por regiões', xlab = 'Região', ylab = 'Taxas')

#Procede-se à execução do fligner.test e shapiro.test para comprovar a homogeneidade das variâncias
fligner.test(insurance$charges~insurance$region)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  insurance$charges by insurance$region
## Fligner-Killeen:med chi-squared = 19.233, df = 3, p-value = 0.0002447
oneway.test(insurance$charges~insurance$region,var.equal=F)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  insurance$charges and insurance$region
## F = 2.6081, num df = 3.00, denom df = 740.96, p-value = 0.05059
model= aov(insurance$charges~insurance$region)
summary(model)
##                    Df    Sum Sq   Mean Sq F value Pr(>F)  
## insurance$region    3 1.301e+09 433586560    2.97 0.0309 *
## Residuals        1334 1.948e+11 146007093                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = insurance$charges ~ insurance$region)
## 
## $`insurance$region`
##                           diff         lwr        upr     p adj
## northwest-northeast  -988.8091 -3428.93434 1451.31605 0.7245243
## southeast-northeast  1329.0269 -1044.94167 3702.99551 0.4745046
## southwest-northeast -1059.4471 -3499.57234 1380.67806 0.6792086
## southeast-northwest  2317.8361   -54.19944 4689.87157 0.0582938
## southwest-northwest   -70.6380 -2508.88256 2367.60656 0.9998516
## southwest-southeast -2388.4741 -4760.50957  -16.43855 0.0476896

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.

For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.

Regarding the region variable, outliers are present in all factors. It is noteworthy that the factor with the highest mean regarding charges is “northeast”. Considering the results from the ANOVA analysis, significant differences are observed only between southwest and southeast. Therefore, the other factors do not show differences between them.

Correlations

Based on the Pearson correlation coefficient, the correlation between the quantitative variables was performed, which in this case are also continuous.

# Criação de dataset com apenas variáveis numéricas
insurance_num <- insurance
insurance_num$sex<- NULL
insurance_num$region<- NULL
insurance_num$smoker<- NULL

# Correlações
#install.packages('ggplot2')
library (ggplot2)
library (GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(insurance_num, geom="tile", label= T, label_alpha=F, label_round=2)

Therefore, only correlations with values greater than 0.5 were considered relevant for the case study, assuming them as strong and positive correlations (1 is considered a perfect linear relationship).

Since none of the correlations obtained a value greater than 0.5, we can verify that none of the variables has a strong correlation with the insurance charges. However, “charges” has a weak positive correlation with “age”, “bmi”, and “children”, with the highest correlation being between “charges” and “age” (0.3), which is logical and expected.

Linear Regression (Full Model)

full.model<- lm(charges~., data=insurance) 
summary(full.model)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16
#Teste a multicolinearidade   
car::vif(full.model)
##              GVIF Df GVIF^(1/(2*Df))
## age      1.016822  1        1.008376
## sex      1.008900  1        1.004440
## bmi      1.106630  1        1.051965
## children 1.004011  1        1.002003
## smoker   1.012074  1        1.006019
## region   1.098893  3        1.015841

Diagnostic measures: The p-value and F-statistics have very low values, indicating that at least one predictor variable is related to the charges.

R2 -> 1, the value tends to 1, indicating that the predictor variables are well fitted in the model.

RSE -> 0, values below 0 produce lower model errors.

The value of RSE is 6062 and R2 is 75%. The intercept is -11938.5 and almost all predictor variables, except gender and the northwest region, are significant according to their p-values. The interpretation of categorical variables, e.g., “smoker”, can be interpreted as “average charges increase by 23848.5 if the individual is a smoker - with all other variables held constant”. The coefficient value, when significant, is the average change in charges with a one-unit increase in the predictor variable - with the others held constant. While correlation measures the strength of the relationship, the coefficient quantifies the relationship and allows predictions from an equation. For example, for each unit added to age, the expected average charge is 256.9 higher - after controlling for other variables.

library (MASS)

step.model <- stepAIC(full.model, direction="both", trace=FALSE)
summary(step.model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
AIC(step.model)
## [1] 27113.66
step.modelf <- stepAIC(full.model, direction='forward', trace=FALSE)
summary(step.modelf)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16
AIC(step.modelf)
## [1] 27115.51

Plots

The Residuals vs Fitted plot linearly relates the residuals and the predictor variable.

The Normal Q-Q plot visually assesses whether the residuals follow a normal distribution. Later, a Shapiro test was performed to verify normality.

The Scale-Location plot checks the homogeneity of residuals and the constant variance regression criterion.

The Residuals vs Leverage plot identifies the values that most influence the regression.

step.modelb <- stepAIC(full.model, direction='backward', trace=FALSE)
summary(step.modelb)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
AIC(step.modelb)
## [1] 27113.66
plot(step.modelb)

No gráfico Residuals vs Fitted, the non-horizontal red line may indicate a non-linear relationship.

In the Normal Q-Q plot, we see that the residuals are not exactly on the straight line, indicating that they are not normally distributed.

In the Scale-Location plot, the non-straight line indicates heteroscedasticity (different variances for all observations).

The contradictory assumptions obtained indicate that this model does not allow for reliable conclusions.

Finally, the stepwise selection method was performed from the full model fit. For this, the ‘backward’, ‘forward’, and ‘both’ direction methods were used. Then, a comparison of the AIC values for the three aforementioned methods was conducted, and the ‘backward’ method was chosen because it has the lowest AIC value.

anova(full.model)     
## Analysis of Variance Table
## 
## Response: charges
##             Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## age          1 1.7530e+10 1.7530e+10  477.0239 < 2.2e-16 ***
## sex          1 7.9167e+08 7.9167e+08   21.5425 3.803e-06 ***
## bmi          1 5.2576e+09 5.2576e+09  143.0687 < 2.2e-16 ***
## children     1 5.5111e+08 5.5111e+08   14.9966 0.0001129 ***
## smoker       1 1.2287e+11 1.2287e+11 3343.5022 < 2.2e-16 ***
## region       3 2.3343e+08 7.7810e+07    2.1173 0.0962211 .  
## Residuals 1329 4.8840e+10 3.6749e+07                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(step.modelf)
## Analysis of Variance Table
## 
## Response: charges
##             Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## age          1 1.7530e+10 1.7530e+10  477.0239 < 2.2e-16 ***
## sex          1 7.9167e+08 7.9167e+08   21.5425 3.803e-06 ***
## bmi          1 5.2576e+09 5.2576e+09  143.0687 < 2.2e-16 ***
## children     1 5.5111e+08 5.5111e+08   14.9966 0.0001129 ***
## smoker       1 1.2287e+11 1.2287e+11 3343.5022 < 2.2e-16 ***
## region       3 2.3343e+08 7.7810e+07    2.1173 0.0962211 .  
## Residuals 1329 4.8840e+10 3.6749e+07                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(step.modelb)
## Analysis of Variance Table
## 
## Response: charges
##             Df     Sum Sq    Mean Sq   F value    Pr(>F)    
## age          1 1.7530e+10 1.7530e+10  477.3270 < 2.2e-16 ***
## bmi          1 5.4464e+09 5.4464e+09  148.3005 < 2.2e-16 ***
## children     1 5.7152e+08 5.7152e+08   15.5618 8.402e-05 ***
## smoker       1 1.2345e+11 1.2345e+11 3361.3366 < 2.2e-16 ***
## region       3 2.3320e+08 7.7734e+07    2.1166   0.09631 .  
## Residuals 1330 4.8845e+10 3.6726e+07                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Distância de cook para visualizar se há pontos influentes na regressão do step.modelb 
n=1338
cook = cooks.distance(step.modelb)
pontInf = which(cook>4/n) # Para ver os pontos com distância de cook superior a uma unidade
pontInf
##    7   15   51   53   67   98  105  127  131  148  173  176  189  216  240  249 
##    7   15   51   53   67   98  105  127  131  148  173  176  189  216  240  249 
##  254  256  266  270  297  313  316  343  344  358  364  367  373  374  385  407 
##  254  256  266  270  297  313  316  343  344  358  364  367  373  374  385  407 
##  410  417  435  459  483  497  498  513  517  527  536  561  562  615  626  669 
##  410  417  435  459  483  497  498  513  517  527  536  561  562  615  626  669 
##  681  683  695  768  773  793  794  803  811  830  838  849  880  914  946  947 
##  681  683  695  768  773  793  794  803  811  830  838  849  880  914  946  947 
##  962  963 1015 1017 1026 1032 1036 1049 1051 1056 1063 1071 1080 1087 1103 1121 
##  962  963 1015 1017 1026 1032 1036 1049 1051 1056 1063 1071 1080 1087 1103 1121 
## 1126 1160 1223 1225 1229 1247 1257 1261 1266 1275 1280 1317 
## 1126 1160 1223 1225 1229 1247 1257 1261 1266 1275 1280 1317
summary(cook)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.000e-09 2.070e-05 1.120e-04 8.514e-04 1.024e-03 2.029e-02

A Cook’s distance is one method used to identify influential points in regression analysis. If the values of the distances are greater than 4/n, where n is the sample size, it indicates that these values exist and should be considered.

The Cook’s distance was used to check for the existence of points that may influence the regression. To determine if the distance is large enough to consider influential values, it must have a value greater than 4/n, which in this case is 0.003. Although on average the values of the distances are less than 0.003, we can see, for example, that the maximum value is much higher than the reference value, proving that there is at least one value influencing the results.

n= 1338
cook2 = cooks.distance(step.modelf)
pontInf = which(cook2>4/n) # Para ver os pontos com distância de cook superior a 1 unidade

summary(cook2)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.098e-05 1.138e-04 8.415e-04 1.014e-03 1.986e-02
#leverage : hj>2(p+1)/n para que na observação possam ser considerados outliers
n = 1338
p = 7
  
limhj = (2*(p+1))/n
limhj
## [1] 0.01195815

As we obtained a value greater than 0, it means that outliers are being considered for the regressions. Given that we observe the existence of outliers, these may be influencing other results such as correlation.

# Consideração de outliers no comportamento do step.modelb
hj = hatvalues(step.modelb)
out = which(hj>limhj)
out
##   67   72  142  156  176  216  285  387  507  615  669  670  781  816 1023 
##   67   72  142  156  176  216  285  387  507  615  669  670  781  816 1023
summary(hj)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003061 0.004523 0.005545 0.005979 0.007062 0.017189

From these results, we can see which outliers are influencing the regression in the step.modelb.

# Consideração de outliers no comportamento do step.modelf
hj2=hatvalues(step.modelf)
out=which(hj2>limhj)
out
##   67   72  142  156  176  216  244  285  387  448  507  514  577  615  664  665 
##   67   72  142  156  176  216  244  285  387  448  507  514  577  615  664  665 
##  669  670  721  781  816  846  861 1023 1080 1330 
##  669  670  721  781  816  846  861 1023 1080 1330
summary(hj2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003824 0.005261 0.006290 0.006726 0.007850 0.018123

From these results, we can see which outliers are influencing the regression in the step.modelf.

Conclusions

Residuals Conditions Verification

shapiro.test(residuals(step.modelb))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(step.modelb)
## W = 0.89909, p-value < 2.2e-16
#shapiro.test(residuals(cook))
t.test(residuals(step.modelb))
## 
##  One Sample t-test
## 
## data:  residuals(step.modelb)
## t = 1.0883e-15, df = 1337, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -324.1596  324.1596
## sample estimates:
##    mean of x 
## 1.798269e-13
sigma(step.modelb)/mean(insurance$charges)
## [1] 0.456668
plot(step.modelb)

plot(step.modelb$fitted.values, step.modelb$residuals,ylab="resíduos",xlab="previstos")
abline(h=0,lty=2,col=2)

Based on the alpha value being 0.05, since the p-value < alpha, we reject the null hypothesis that declares the normality of residuals. With a p-value = 1, we can conclude that the mean is 0, which aligns with one of the assumptions of ANOVA.

boxplot(step.modelb$residuals,col= 'lightgreen', ylim = c(-15000,35000))

boxplot.stats(step.modelb$residuals)
## $stats
##        626        694        735        495        838 
## -9092.9812 -2837.7950  -979.7228  1361.9517  7641.1161 
## 
## $n
## [1] 1338
## 
## $conf
## [1] -1161.1290  -798.3166
## 
## $out
##          7          9         10         15         18         28         39 
##  18510.707  -9479.257   8232.497 -10446.745   9568.061   8407.782   7943.774 
##         41         51         53         63         66         93         98 
##   9050.296 -10026.986  11957.905  -9523.572   8574.914  -9273.671  19940.302 
##        105        126        127        129        130        131        136 
##  20210.750  -9215.459  19770.789   9519.172   8829.309  15416.055   7927.626 
##        138        148        165        173        179        189        216 
##  -9514.438  23176.632   8025.919  13709.305  -9757.760  20948.264   9754.756 
##        240        244        245        248        249        254        256 
##  21689.333   8195.297  12058.754  10683.320  19322.338  24078.626  16390.355 
##        266        269        270        276        283        289        292 
##   9663.772   9237.581 -10390.904  -9467.341   8504.305   9449.515 -10272.180 
##        293        297        313        316        334        336        343 
##  13428.724  19396.736  18044.334 -10844.171  -9613.907  -9554.128 -10807.594 
##        344        352        354        357        358        360        364 
##  12489.813  -9375.536  -9773.335  11014.077 -10223.873  12509.867  20310.684 
##        367        373        374        385        394        398        399 
##  -9888.664  15669.645  19373.048  13366.889  12028.768  14303.257  12090.666 
##        406        407        410        417        421        433        435 
## -10482.763  13274.838 -10913.229 -10582.782  -9597.055   7762.998 -10023.764 
##        439        442        444        459        465        466        483 
## -10576.323  10653.680  -9871.992  25382.867  -9848.007  -9523.075  13943.730 
##        485        486        497        498        510        511        513 
##  -9587.265  -9944.655  -9666.040  18157.414  -9978.286 -10818.821  23128.671 
##        517        527        528        536        542        554        561 
## -10965.789  22087.079 -10093.626  13007.301 -10122.613 -10199.702  14060.678 
##        562        568        581        586        609        613        615 
## -10137.637  -9744.146 -10132.397 -10289.095  10475.040  -9721.196   8649.263 
##        620        623        626        652        654        656        669 
##  13770.020 -10186.331 -11367.181 -10061.132  -9885.241 -10662.811 -10455.761 
##        672        673        681        683        684        695        729 
##  14068.079 -10015.062 -10120.609 -11091.218  -9791.223  16038.188 -10588.767 
##        730        731        736        752        754        762        764 
##  -9946.512 -10433.716  -9695.454 -10661.519   7670.791   8355.364   7873.530 
##        768        769        773        774        778        779        793 
## -10835.506  -9874.389 -10860.249  -9977.042  -9449.399 -10157.174 -10885.504 
##        794        797        803        811        830        832        838 
##  13877.051   8269.353  21772.769 -10661.965  19281.879   8623.973  29935.533 
##        844        849        863        872        876        880        884 
##   7752.522  16384.764 -10123.028  -9990.787   9498.524  18472.082  -9249.045 
##        886        889        892        895        897        900        911 
##  -9844.663  -9635.508  -9170.831  -9347.425  -9610.292 -10192.917  13411.785 
##        914        916        921        936        938        946        947 
##  15583.240  11822.622  -9143.649  -9978.541  -9790.762  -9829.981  16845.921 
##        954        962        963        972        989       1004       1005 
##   9076.598  20239.054  15532.302  12651.269 -10034.181   8067.725  -9978.441 
##       1015       1017       1020       1026       1028       1030       1032 
##  -9998.844  15290.379  -9728.717  18794.274  12046.785  11216.939  18360.795 
##       1034       1036       1037       1041       1049       1051       1056 
##  -9615.356  22062.375   7990.242  -9409.694  -9673.152  16133.936  15669.152 
##       1063       1071       1074       1076       1078       1080       1087 
## -10313.543  14243.689  12775.645   7664.944  -9305.114  23026.094  14533.842 
##       1103       1107       1121       1126       1132       1157       1160 
##  24425.453   8496.639  14076.049 -10580.262  -9290.259   8473.845  11973.828 
##       1180       1208       1209       1222       1223       1225       1227 
##   8154.065   7963.747  13308.487   8589.906  21964.149  17096.080   8355.571 
##       1229       1243       1244       1247       1252       1255       1257 
##  17147.735   7789.203   8585.316  15177.558   8199.265   7967.535  13368.624 
##       1261       1266       1268       1275       1280       1296       1298 
##  11566.051  20743.565   8073.743  12964.297  14720.164   8170.784   8980.802 
##       1301       1317       1318       1320       1326       1337 
##   8375.179  17223.178   8408.754   8673.634   8157.324   8853.505

In conclusion, we can observe that:

Regarding exploratory analysis, the mean and median in the age variable present very similar values, and ages are distributed similarly without outliers detected. The youngest patient is 18 years old, and the oldest is 64.

There are slightly more men than women, with this difference being around 1%.

In BMI, there is an asymmetric distribution of results as patients fit into different categories. The mean and median have very close values, almost identical. Some outliers can be detected in all groups, with a greater presence in the obese group, as expected since this group has the most samples. The minimum BMI in the sample is 16, and the maximum is 53.1.

In the case of children, the mean and median have very close values, but the distribution of values is highly disproportional, with about 43% having no children. Among patients who have children, most (24%) have 1 child, with the maximum in this sample being 5 children (1%).

There is a significant difference between smoking and non-smoking patients regarding their distribution (20% and 80%, respectively).

Regarding the regions variable, the values are distributed similarly except for the South East group, which has a slightly higher value (27%).

For the charges variable, we can see that the mean and median are very different from each other, showing a high skewness in the sample. From the boxplot, several outliers can be detected. The minimum value is 1121.9, and the maximum is 63770.4.

Through the ANOVA test, it was possible to see if there were significant differences in prices within each of the variables. It was demonstrated that, for example, being a smoker or being older causes an increase in the insurance value since they are presumably people who need greater medical attention due to a weaker health condition. The results obtained in this test were in line with those of the t-test performed at the end.

We also observed that when correlating quantitative variables with the charges variable, there were not very significant results except for one, which, although weak, was relatively close to 0.5 (threshold used to consider a strong correlation), which was age. This was an expected result because, as mentioned before, older people need much more medical attention than younger patients.

Finally, by examining the graphs obtained from the linear regression analysis and the Shapiro test, we saw that we did not obtain normality in our results. This was confirmed by checking the Cook’s distance and observing the information from the residuals boxplot, which showed the presence of several values that are significantly influencing the results. These outliers may have a significant impact, affecting the correlations previously analyzed considerably.

In addition to the outliers that are causing changes in the final results, some variables such as smoker and children present highly skewed distributions of values, which also influence the impact of these during the analysis.

We were able to verify that smoking is a significant indicator for the increase in insurance prices, caused by the need for greater medical attention, and we also observed that insurance prices do not increase gradually with the number of children. From the boxplot made of prices based on the number of children, we see that having 3 children covered by insurance seemed to be cheaper than having only 2 children, and having 5 children seems to lead to a smaller price increase than having 4. This can be explained by the unequal number of values being analyzed for each group, as we have, for example, 18 patients with 5 children and 324 with only 1.